EOCUMENT RESUME 



IC 074 150 



TM 002 515 



EDRS PRICE 
DESCRIPTORS 



AUTHCR Millinanr jason 

TITLE Psychometric Characteristics of Performa.nce Tests of 

Teaching Effectiveness, 
PUB DATE Feb 73 - 

NOTE 12p*l Paper presented at annual meeting of the 

American Educational Research Association (New 
Orleans, Louisiana, February 25--March 1, 1973) 

MP-$0,6S HC*$3.29 

^Effective Teaching; ^Performance Tests* 
^Psychometricsi Scoring; Speeches i Teacher 
Evaluation I Technical Reports | Testing i Test 
Reliabilityi Test Validity 

ABSTRACT 

Teaching performance tests are measures which assess 
a teacher's ability to accomplish prespecified instructional 
objectives. Although possessing much face validity, little 
psychometric inforniation is available about such assessment devices. 
Three separate studies were conducted to provide information about 
the validity, reliability, administration, and scoring of perforinance 
tests of teaching effectiveness* (Author) 



ERiC 



FILMED FROM BEST AVAILABLE COPY 



iPsyqhpmetrtc Charectftri gtrcs of Ferforminc^ T€S:^i 

12 

of Teaching Effectiveneis * 



Jason Mi I Iman 
Cprnel 1 Uni vers 1 ty 



The fmpDrtance of a valfri way gf meaiuring teaching effect fveneis 
is ©bvious. The performanee test approach described .in the pravious paper 
by ['L Jamai Popham hat much appeal faecauie of Its face validity effective^ 
neii Is maasu red by changes In student behayjor* ,,,Optimum use of performance 
tests of teaching effectiveness as crl terJa and dependent variables, however, 
requires much more informatjon than presently exists about the psychometric 
characteristics of these measures* The purpose of this paper Is to describe 
the rasults of three investigations designed to get some information on such 
charaeterlstlcs. , - 



'paper presented at the annual meeting of the American EduGatlonal Research 
Association, New Orleans, February 1973* 

i_ ■ - - ^ ' ' " ^' - " ' " - ' 

The cooperation of the Pasaderva Public Schcfolj ;iji gratej-ul^ly aclyiw 



STUDY 1 



282 Students and 3^ student teachers In an urban school pirtici* 
pated. Students were In grades ^, 5 and 6. kB% Of the student? were black, 
171 representad Spanish and other nonblack minority groups. 



Instructional MpterlaU 



Three leifons were employed. Their titles and the statement of 
the Instructional task as given to the teacher followi 



Light Rays- 



Adjectives . 



Given a diagram conJuining parillel light 
rays, ai r,; and glass ^ thu student will 
I nc/i cate whethar the light rays will tend 
to converge or tend to spread apart by 
circling the correct term* 

Given sentences containing an underlined 
adjective, the pupij wMl dej ipnaio. whi ch 



sentenGes 
cl rcl Ing a 
sentence. 



li 



tfcer .S'.pJaced before t hi 



Folkways? (shont^ lesson) > ^ 
jf . of fol kways , 



When presented with a list 
>the pupj I . wi 11 be able to 
designate which .fp^lkways ire mores and 
whi ch are not. 



Folkways (long lesson). V/han presented with a list 
of folkways s the pupil will be able to 
designate which folkways are mores and 
which are notp and which are primary 
and which are secondary. 



Instruments 

A cognitive and an affective criterion test was constructed for 
each lesson* The cogn i t i ve cri terlon test consisted of items consfstent 
with the Instructional objective and similar to the sample queitiorti pro- 
vided the student teacher. The affective tests for the two Folkways lesson 
wera identical and differed from the affecttve test for the other two Isssons 
only in the reference to the name of the lesson. The affective tests spticlte 
i#ach student'ti^opinlon i(>s)fv afl::tHfl^ lesson and '(b) of the teachers Tte|E^' two 
typas of questions were combinedhTrito a- total score when two tepa fa to ^factors 
f ai led to appear. 

A number of '^controi'^ measures were also avariable. ^ WAWe I'rlcluded 
a test of liking for school (e.g. p Do you 1 1 ke stay Ing at home and watching 
T,V, better than going to school?) , a general vocabulary test, and separate 
msasures for each iesson d^^signed to measure ski 1 Is related to* but distinct 
from, the cognitive skills to be taught* Also available for each child was 
his sex, grade Isvel , racei and a school rating of general abi 1 1 ty txpressed 
on a 3-point scale* 



2. 



Design ' 

The variables were I Time for instryctfon (15 mrnutes); Class size 
(i|"6 pupMs, pupils); Trial i.e., attempt at teaching a minilesson 

(1, 2, 3); and minriesson {light rayi , adjactives, folkways short, folkwiys 
long)* No child was taught the same lesson twice nor (in Study 1) did the 
student teacher teich the same lesson twice. The teacher taught the same 
childran for the first two trials and a different group of children for a 
third, trial. Assignment bf student to group was supposedly on a stratified 
random'basis (grade and academic level rating the stratifying variables)* 

Calcu lat Ion of Teach ing Effect ivenass Measure 

The adjusted moan scores of students being taught a lesson '(by a 
student teacher) on the affective and cognitive criterion tests were the 
teaching effectiveness scores for that teacher. The mean scores were adjusted 
(using analysis of covarlanca) for Initial differances on control measures and 
for the unreliability of the control measures. Based on preliminary muItiple^ 
regression analyses in which the control measures were uied to. predict the 
criterion measures, only one to three control messures were used to adjust 
any one of the criterion measures* These effectiveness measures were standardized 
to a mean of 50 and standard deviation of 10, 

■f i 

R esults ' / . : ; 

A, Criterion Test Rel i ab! 1 i ties and Mul tlple R with Predictors' 

Cognt t t ve A ffective 

Lesson 
■ Light Rays 
Adjectives 
Folkways (Short) 
Folkways (Long) 



Since a teacher's effectiveness score was the me.an score pn the test 
of students being taught by him, the functioning rel iabl 1 ity of the' yipendent 
variable Is higher than the values shown* This Is becaustf group imearti are, 
in general, more reliable than an Individuals score. 

Tha ]oN values of ^ suggest that the control majeures tw^ 
In equating pupM differences eKlsting prior to the insturctlon* 



B. Test-Retest Teache 'Effectiveness Rel iabi 1 1 ty Data 

Since teachers taught more than one lesson and since students were 
taught more than once, it was possible to , correlate^ffectlveness scores over 
trials. Specifically, the (a) row in the table beloy^! indl cates that a teacher's 
effectiveness score on the cognitive criterion when she taught, one Jesion was 



KR20 




KR20 


R 


.47 


.16 


.58 


,, .20 


.41 


.45 


'54 


.15 


.72 


.35 


.71 


, .28 



.56 .56 



3. 



essentially uncorrelatad (.02) with her corresponding score based on another 
mini lesson taught by her to the same group, Although 2^ paired obsarvatlons 
vxwere- inyolvpd. Jn the cilculatipn^ correlations were computed separately for 
eacthpair of fninilesions and, consequent ly^ the digrees of freedom associated 
-iWi^th^fJ^he, ma«i. r of ,02 is only 18. ■ 

= >fl i h M. f t .■], , , . \ "■ ' = ■ J ' 





Cognitive 
Criteribh Tests 
l-J df ■' r ' 


Affective.: 
Criterion T»£tS' 
N ' df I'Vi'i 


(a) Both lessons taught to the 
SAME pupils (same tPichers) 

(b) Doth las sons ■.taught to , 
^ i DIFFERENT pupMsj (iame 

5 teachers) , i 

' V,;{c); Both, 1 ey sons taught 'to the 
■ , ', SAWE pupfls {'different 
.'■■teachers) 


Zli 18 .02 
38 20 -.03 

33 23 .si 


Ik 16 .3k 
kl 32 .31 

35 r 25^ -iao ; 



The correlations In row (a) were predicted to be the highest; those 
m row (c) the lowest. The value of in row (a) and more ©special !y In row 
(c) differences in pupfl ability prior to initructlon wer^ not cQatrollaid. The 
value of ,02, however,.|i^ ^p^rplexing. 



' 1 rw 



below,^' Data' are for the 21 teachers' who taught all three minjlessons. 



The 



C. Analysis by Teaching Trial 
effectiveness scores analyzed by trial are shown in the table 



r. , 

Trial 


11 


Cogni tlv 
Mean 


e 

S.D. . 


.,1; 
N 


First 


21 




11.5 


21 


Secon^,^ 






7.9 


21 


Thi rd ■• 


.19* 


50.3 


10.9 


21 



Affect,! ye:; ^ 
Mean . S.D. 



48.9 0.5 
50,6 .10.2 
;it9.'| 11.6 



.iuiri si 



^Students in two groups we re not adm i n I s te red con t rol tes t's for the cognitive 
' crl terl6ri test SO. adjusted teaching could.not be computed. 



Teaching trial had a negl Igible effect In this study. 



Analysi!! by Length of Teaching Tit 



The design permitted Ja :aomparison. between af f|eti veness scores whan 
teachers were given a short ti cite to teach; (15 minutes), .and scores when they 
ware given a longer teachlng'3time (30 mlnutes^n ni«n#^4 different criterion 



test was employed for the Folkways Jesson under the two time GondftIons» t'm 
short-t Ime./ long-t ime compariion is possible only for the other two lessons. 
Results af'e below, = 







Cogn 


1 ti va 


C rl teri on 






Affective 


Criterion 




Lesson 


N 


Short 
Mean 


S.D. 


N 


Longer 
Mean S. 


D. 


U 


Short 
Mean 


S.D. 


H 


i.nnger 
Ilea 1 


i.D. 












Light Rays 


io 


47.6 


6.9 


17 


51.4 11 


.3 


10 


47.6 


11.4 


17 


51.5 


8.8 


Adjectives 


21 


50. 1 


10.2 


0 


50.0 9 


.4 


21 


49.7 


10.5 


8 


50.9 


8.3 


Total 


31 


49.3 


9.3 


25 


51.0 10 


.7 


31 


4g.0 


10. 9 


25 


51.3 


8.6 



The extra teaching time did not result in increased effectiveness 
scores of teachers for the adjectivas lesson. There was about ,k of a standard 
deviation difference on the light rays lesson* This diffarence was not statisti- 
cally significant. 



E. Analysis by '*Class** SJm 

The design permitted a comparison between effectiveness scores when 
tea^pheri taught larger groups (usually 8^12) and icores when teachsrs taught 
smaller grpups (usually 4i^6) . Results are shown beloWt 



Lesson 


Cognitive Criterion 


Affective Criterion 


Small Class 
N Mean S.D. 


Larger Class 
N Mean S.D. 


Small Class 
N Mean S.D. 


Larger Class 
N Mean S.D. 


Light Rays 
Adjectives 
Folkways 


13 45.0 5.9 

14 46.9 11.9 
10 49 . 2 12.1 


14 53.9 II. 4 

15 53.0 6.6 
11 50.8 7.6 


13 49.9 11.4 

14 51.0 8.1 

' ii 53.9 9.1 


14 50.1 85 

,]S 4g.o 11.3 

12 462 9.3 


Total 


37 4741 10.4 


40 52.7 8.9 


38 51.5 9.8 


41 48.6 10.0 



The larger class size was associated with higher performance on the 
cognj tive tests (about one-half of a standard deviation) , but wf th Ifomawhat 
lower ratings on the affective scales; For al 1 lesions CQmbJned^ the difference 
of 5*6 points on the cogn i t ive cri terion was if gni f leant .(£<,05) > the correi* 
ponding difference of '*2,9 points on the affective criterion measure w«s not 
, ,s J gn I f icant. 



F* Analysis by FamI 1 lari ty wi th Content -'^ 

Minilesson instructors were given the following questi am "Pretend ! 

you Ave re given,'^^ a test* « temi 1 ike the sample test items but we'rb not al lowed J 

I,,.:,;, to read, the explanatory mater^iaK How well do you think you would do bn such a t 
,^ test? (a) Poorly, I 'd have to guess at the i tems* (b) 0,K, I would know 

ERIC 



5. 



some and^lss some. (c). Welil , I was f ami liar enough with the ideas ahead 
of time that I would probably have answered virtually all the Items correctly. 

The data In the table below do NOT support the hypothesis that" 
Instructors who rate theinielves as knowledgeable about the content they teaeh 
wrii have higher effectiveness scores than instructors who do not rate them- 
selves as high* 





Cognitive Criterion 
N Mean S.D. 


Affect! ve 
N Mean 


Cri terion 
S.D. 


(a) 


would 


do 


POORLY 


27 


48.4 


to.s 


27 


50.4 


9.5 


(b) 


wou Id 


do 


O.K. 


12* 


'51.4 


.' 10.0 


33 


50. 1 


9.8 


(c) 


wou 1 d 


do 


WELL 




50.1 


7.0 


19 


49.1 


10.9 



'VStudents in two groups were not . administered contro l tests for the cognitive 
critel^fon test so adjusted teaching ef fectj veness scores eou J d not be calcu- 
lated. ' .,, 1^^^ : '. ■ , 

G. Analysis of Effect of Instruction .on CONT ROL Test Performance 

lM^ control tests v;e re adniinister:ed to students after they had 
^^'fristructrort. \This was done f^^ admlnlitratrve convenlertca, ai itudents only 
had to be tested In one si tt?ng[^, (for each mlnUesson) . HcweVer, this can be 
a '^dangerous'* practice, because the instruction might change tfife test perfor- 
mance and the very effects one wants to measure might be "controlled away*" 
To see if control test performance was Influenced by instruction^ ^ or 5 

JV^rtlidents f rem^each of 9 classrooms were giyen two control tests prior to 
ah^ Insjtructlbrt.- i The two tests chosen :were\ the contrpl test on attitudes 

^ '^^towitd ksheol and./the control test for the folkways lesson the contrQl 
test asking students which of a number of practices were laws; and which were 
not lawW* * ResuUs ar^. shown^ below.^ ^ , j 





Attitude Contrpl Test 


Foikway Control Test 


Time of Testing 


N 


Mean 


S.D.* 


N ' Meun S.D.*- 


Before Instruction 


38 


4.84 , 


1.26 


19* 7.95 97 


After Instruction 


38 


4.39, 




19* 5.95 



• ^^^Of the difference scores. . ' , . 

^^There are fewer cases here because only half of the students taking this 
control test prior to Instruction were taught the folkwiy min!1^|SW^^ 

The differences are of statistical and practical stgnl f Icance, 
Although thesB two control tefets were the oner to 
the effect being studied, and although the change of scores does not mean 
that the several teachers would be affected differentially, the results do 
" Suggest thht caution be exercised if this tactic of administrative convenience 
'Ms'-'to be used- ) ...... . - 



STUDY 2 



155 Students and 10 student teacherBin an urban school, ]kk students 
were In grade 3$ in grade 2, 3^1 Of the students were Black, I8| represented 
Spanish and other nonblack minority groups. 



Ipd tructional Materials 



Two "mi n 1 lessons" were employed. The titles and the statement of the 
Instructional task as given to the teachers follow. 

Decoding. Given a list of ''words'* containing symbols 
that have a sound and symbols that do not 
have a sound, the children will be able to 
circle the "words" which have a SHORT sound* 
(Using nonsense symbols, the task was similar 
to that for determining long and short vowel 
sounds in printed words that follow conso- 
nant-voweHconsonant and consonant-vowel 
patterns.) 

^ ' Rhythms. Given an orally presented sentence or verse having 
■ ^ either an Iambic or dactylic rhythm, the children 

\r will be able to distinguish between the meters by 

circling a picture of either a balloon or a bumble;^ 
bee. // 



Instruments 



A cognitive and an affective cri terl on test was constructed for each 
lesson. The cognitive criterion test consisted of items consistent with the 
Instructional objective and similar to the sample questions p.rfl>/|d0d the stu- 
dent teacher. The .iffectfvG tests for the two lessons vie re i'dentl|Cal except 
when refe^'eneing the name of the lesson. The affective tests aollcrted each 
student's opinion (a) of the lesson and (b) of the teacher. These two types 
of questions were combined into a totSl score when two separate factors failed 
to appear, 

"Control" measures were also ava! lable* These included a test of 
liking for school (e.g.. Do you like staying at home and watching T*V* better 
than going to school?) j and separate measures for each lesson designed to 
measure ski 1 Is related to^ but dis ti net ,f rom^ the cogni tive ski r 1 bei ng 
t3ught* Also aval lable for each chi Id was his sex, race, and a J^point 
rating of general abi litymada by school personnel . 



Bes I gn , 

The plan was for each teacher to teach three times one lesson 
twice and the other lesson once. Further, each teacher was to teach the 
same group twice and another group once. Only 12 of the 1 8 teachers com- 
pleted all three lessons* . 

Assignment of students to groups was on. a stratified random bas is 



with academic level rating the stratifying variable. Group size was limited 
to &p with the students in any one group ooming from difftrent homeroami. 
Ke^t' ^con^fitit was instructional time (15 minutes),, . . ; i 

Calgulation of Teach 1 ng Effect ivjness' Heasures 

The same procedure .ijs employed in Study I v/ai used* ^This Involved 
computing the mean score of scudent.s Ir. u group'on tha criterion measures and 
^adjustlng^th^se: means for initial argup di f furences on • che "control" measures, 



Results 



^Cri. terion Test He,] labi 1 1 1 las and Multiple (l with Pradictors 





.; Cogni ti ve 


Affect 


ve 


Lesson 


, KR20_ R_ 


KR20 




Decoding 


.38 176 


.50 : : 


,.,16 


Rhythms 


.18 .W 


.^3 


.18 



rrie tes.t. rel lab! 1 i ties b.ro^ barely satisfactory and the multiple 
cpi^i^f lit Ions priadri^ting the affective criteria are □nsatisfactory. The multiple 
eorrelations predicting tha cognitive cfiteria are purprisingly high given the 
KR20 values of the dependent variables. 



B, Test^Retest Teacher Ef fectlvenoss Reliability 



ERIC 



:v B^ause^pf the dpsign usetfj It was |.^s£ible to correlate effectiveness 

sippres uridSj^ threa^ondi t ions when the teacher taught thfe same lesson; both 
>t;^es (toj^ijfferent,,groMps of students ^ of "course) ; when, di fferent ^Ifeiibns were 
. jajught th% two timfS j but the. same group of students were used i SWcf final ly 
x>^hen dif^fpirtent .l^a^pns and di ff|4re.nt students were invdived In the two teaching 
itrl.ils* s The .cp,r^i^]^^ions shpvyn .b^^lc^ are cictual ly means of several^^tbHrelat ions 
each computed on a sma 1 1 number 'bi lie jchorn Havi similar teachfng^'BKptfr iences . 
For example^ a separaue corrulatlon was computed for the three teacHiSrs who 
ifjUght thjS dec^di;ig.|lesson boih times their fiiSt and second teiching atten 



GwTiJ tl ve 



Affect iMflpur 





;ri totM^on Tasti 
N df r 


Gri teriop; Tftsts 
N df ■ ' r 


Same lesson ^ dl fferant students 
.^.^Different lessons , same students 
^..^ij^J fferent lesion different student's 


12 6 -,23 
12 6 ,71 
12 ■ ' '8' .09 


11* 5 .26 
12 6 .18 
■ "7 -.22 





*ln one group of students, the control affective maasure was not administered, 
' and thus a- teaching effect iveness score cojy;Jd' ^ 



8. 



C. Analysis by Teaching Trial 

The data are based on the 12 teachers who taught the tw6 'min i lessons 
a total of three times. One teacher whose group did not take a "control" test 
m the arfective area was excluded . 







Cogn 1 1 i ve 


Affect 


ve 






Cr i ter 1 on 


Tests 


C ri ter ion 


Tests 




N 


Mean 


^S.D. 


M . Mean 


S.O. 




1 ^ 


111 •) 


6.7 


n 50.8 


9.1 


A teacher's second trial" 


l'2 

li 


50. 1 


1 1. 1 


11 %k ^ 




The 1st time a teacher 












taught a mi n 1 lesson ^ 


12 


^7-3 


6.7 


n 50.02 


9.1 


The 2nd time the teacher 








taught the SAME mini- 














li. 




__12^2._. 






1st tinie a group of 












students taught 


12 


47.8 


•9.6 


11 55.3 


7.6 


2hd time the sartie 








group of studerits 












taught 


12 


51.5 


10.9 




8.6 



The eKperlence of previously teachrng the s^me mi ni lessbri 6h^ the 
ifme group Is associ ated wi th h i gher effect I veness scores on the cognitive 
nteasures^ pami 1 iarj ty wl th the studenti was associated with^ lower ratings 
on the affective criterion, an observation noti^ prevlbgsly by John McNeil^ 
Although iuggestFve, none of the differences is statiStti^l ?y sIgK'ififeant at 
P^,05 two-tai led. 



D. Analysis by FamiUarlty of Content 

As wBs done In Study 1, mstructors were asked how well they would 
do on the post test type of questions if they had not been given background 
information about the subject. On the decoding lesson, the posttest consists 
of words'' which use ^'nonsense'' type symbols, and all teachers would have to 
guess on such questions if the arbitrary key was not given 
RESULTS FOR RHYTHM 



TEST ONLY 



(a) Would do poorly 

(b) V/ould do O.K, 

(c) Itfould do wel 1 



Cogni ti ve 
Crfterlon Test 
M ■ Mean . S*D. 




Affective 
C ri ter f on Test 
N Mean S*D. 



Note— scores reported above do not have means and S.D.s of 50' and 10 because 
the Instructor's second trial, results are not^hown. 



Again,' these ratings do not seem particularly related to the teacher^s 
effectiveness score* ^ - 



3 



pe rson commun i ca 1 1 on 



STUDY 3 

Subj ects' - ■ \ 

A convenience sample of 58 teachers from Long Beach, California 
and HiwaM, Students were elementary^aged children* 

I ns trugtional Materi als and i ns t rumen t s > 

" The light rays ^ folkways (long), rhythms, and decodi ng lessons 
and criterion tests (see studies '1 and 2 above) ware ysed. In many cases, 
control tests were not administered. 



Pes ign . V, 

^ 'No planned design was followed* In a few cases^^ teachers taught 
more than one minilesson wl th relatively little confounding between order r 
and mirt^ilesson.*. Data are reported only for teachers instructing with more than 
one mini lesson.^ Class slies ranged from four to sik with the Vast majority 
of them being six in number* The method by which students were assigned to 
teachers^ls not known, ^; 

•etalculation of i Teaehlng Effect uveness Measures ..k. m r 

^ WhertiiGontrol test data were available and y-corr related at least >^15 

with criterion tests s mean class, scores were adjusted M^n accordance with the 
. procedures described for Study 1, 

Results 

A, Test-Retest Teacher Effectiveness Re 1 1 abi 1 1 ty 



. Cogni ti ve ^ 


- Affect 


ve 


Cri terion Tests 


Cri terion 


Test^s^^ 


N d,f , r 


H . d , f 


r 


13 9 .05 


12 8 


-,22 



' ^If fectiveness scores nbt adjusted. 



^The standard scores reported do riot have a mean of 50 (and S,D; = 10) b^filuse 
the .subsample.on which multiple mini lesson data were aval lable Is not exactly 

^'rebt'esentatlOe of thi total sample of standirdiiatidn of scores were 

ba%ad, ^ ^ 



to 



b. Analysis by Teaching Trial 



i 




. Cogn! tl ve 


Affective 






Criterion Tests 


Criterion Tests 


Trial 


fl 


Mean S.D, 


H Mean S.D. 


Fi rst 


17 


10.9 


18 ^7.3 6.2 


Second 


17 


53.6 8,0 


18 ^0.3 10.5 



cognitlvB The tendency found in Study 2 of Improved effectiveness scores on 
the/criterion meaiures with practice was evident hare also (t,^^2.62, 
.Ol<£<.02) , lb . . . 

it ii '^t is '^t it 

s .... 

COIICLUDIMG THOUGIiTS . 

To FTiost teaching performance test advocates, the most disturbing ■ 
raiults reported Fn these studies are the erratic and low test^retest rel!^ 
abiHties. Although one reviewer of previous data on this question concluded 
that results fromgsuch studies were not inconsistent with a hypothesis of 
zero reliability, more encouraging findings were reported in one ©f the 
largeirj better designed studies*^ 

Why the low reliabilities? Several pQislble answers come to mind. 

Reason : The performance test results may be reflecting the true 
state of affairs; teaching fs not a general Izable act, Reply ^ It Is true 
that like happiness, anxiety and other traits that vary over occasions , It 
is no doubt the case that a teacher^s ^^real" effect ivaness varlei with the 
situation. Nevertheless, one could have hoped to obtain a more Interpretable 
pattern of correlatiohi In which rel iabi Uties would be highest when the situ^ 
atloni (e.g. , mini lesson being taught j student group involved) were most 
similar. 

Reason : The abilities and attitudes of the learners are what really 
make the difference--the teacher- s input accounts for a negligible amount of 
the variance. Reply ; Ue obviously were not able to control adequately the 
pretreatment Individual differences Viriables, but even when the same students 
were involved in the teit=retest correlations, results were erratic. 

Reason i The conduct of the investigation during the data gathering 
phases was faulty. Reply i There is a risk Involved when a study is being 
carried out 2,fiOO miles away from the principal Investigator. Except for 
study 3 whi ch was never Intended to be a control led inveitlgat ton , to the 
best of my knowledge the data were col lected In accordance wi th the planned 
design. 



G lass , Gene V. Stati stical and measurement problems In implementing the 
Stull Act. Stanford, California: Stanford University Invitational Con- 
feranca on the Stull Act, 1972. 

^Balgard, M., et_a1. Pages 182-209 in \testbury and Bel lack (Eds), Resaarch 
into class room processes : re can t de ve 1 opmen t s an d next steps , i Jew York: 
TaaGherg Col leqe Press , 1971. _ . . _ 



11. 



Reason i The analyils lacked power in the statlstScal sense of 
the word. Replyi Clearly trie erratic results are due In large measure 
to the few teachers involved In most CQmparlsons, For e^afflples the 951 
conFidence interval for a correlation of 0 computtd on 8 taacheri exceeds 
+ .60. 

Reason I The control and criterion Instruments were not that 
rel iable. Reply r I t may be that wi th mQre re] jable measures uti 1 f^ing 
more itemi coil^ected on larger students groups after longer Instructional 
sessions that the teaching performance test^ will be a more rel fable indi- 
cator of teaching effectiveness* 

Clearly more deffni tlve work IS needed on teaching pirformance 
tests* It Is to this end that my future empirical work Is directed. 




